Chapter 5 Parts-of-Speech Tagging
In many textual analyses, word classes can give us additional information about the text we analyze. These word classes typically are referred to as parts-of-speech tags of the words. In this chapter, we will show you how to POS tag a raw-text corpus to get the syntactic categories of words, and what to do with those POS tags.
In particular, I will introduce a powerful package spacyr, which is an R wrapper to the spaCy— “industrial strength natural language processing” Python library from https://spacy.io. In addition to POS tagging, the package provides other linguistically relevant annotations for more in-depth analysis of the English texts.
Again, the spaCy is optimized for many languages but Chinese. We will talk about Chinese text processing in a later chapter.
5.1 Installing the Package
Please consult the spacyr github for more instructions on installing the package.
There are at least four steps:
Install miniconda (or any other conda version for Python)
Install the
spacyrR package
install.packages("spacyr")
- Because
spacyris an R wrapper to a Python pacakge, you need to have Python installed in your OS system as well. The easiest way to install spaCy and spacyr is through the spacyr functionspacy_install(). This function by default creates a new conda environment calledspacy_condaenv, as long as some version of conda is installed on the user’s the system. (spacyruses Python 3.6.x and spaCy 2.2.3+)
library(spacyr)
spacy_install(version='2.2.3')
The spacy_install() will create a stand-alone conda environment including a python executable separate from your system Python (or anaconda python), install the latest version of spaCy (and its required packages), and download English language model.
If you don’t have any conda version installed on your system, you can install miniconda from [https://conda.io/miniconda.html]https://conda.io/miniconda.html. (Choose the 64-bit version, or alternatively, run to the computer store now and purchase a 64-bit system to replace your ancient 32-bit platform.) Also, the spacy_install() will automatically install the miniconda (if there’s no conda installed on the system) for MAC users. Windows users may need to consult the spacyr github for more important instructions on installation.
For Windows, you need to run R as an administrator to make installation work properly. To do so, right click the RStudio icon (or R desktop icon) and select “Run as administrator” when launching R.
- Initializing spaCy in R
5.2 Quick Overview
The spacyr provides a useful function, spacy_parse(), which allows us to parse an English text in a very convenient way.
txt <- c(d1 = "spaCy is great at fast natural language processing.",
d2 = "Mr. Smith spent two years in North Carolina.")
parsedtxt <- spacy_parse(txt,
pos = T,
tag = T,
lemma = T,
entity = T,
dependency = T)
parsedtxtThe output parsedtext is a data frame, which includes annotations of the original texts at multiple granularities.
- All texts have been tokenized into words with each word, sentence, and text given an unique ID (i.e.,
doc_id,sentence_id,token_id) - Lemmatization is also done (i.e.,
lemma) - POS Tags can also be found (i.e.,
posandtag)pos: this column uses the Universal tagset for parts-of-speech, a general POS scheme that would suffice most needs, and provides equivalencies across languagestag: this column provides a more detailed tagset, defined in each spaCy language model. For English, this is the OntoNotes 5 version of the Penn Treebank tag set (cf. Penn Treebank Tagset)
- Depending on the argument setting for
spacy_parse(), you can get more annotations, such as named entities (entity) and dependency relations (del_rel).
5.3 Working Pipeline
In Chapter 4, we provide a working pipeline for text analytics. Here we like to revise the workflow to satisfy different goals in computational text analytics.
After we secure a collection of raw texts as our corpus, if we do not need additional parts-of-speech information, we follow the workflow on the right.
If we need additional annotations from spacyr, we follow the workflow on the left.
5.4 Parsing Your Text
Now let’s use this spacy_parse() to analyze the presidential addresses we’ve seen in Chapter 4: the data_corpus_inaugural from quanteda.
To illustrate the annotation more clearly, let’s parse the first text in data_corpus_inaugural:
We can parse the whole corpus collection as well: we first apply the spacy_parse to each text in data_corpus_inaugural using map() and then rbind() individual resulting data frames into one using do.call().
system.time(
corp_us_words <- data_corpus_inaugural %>%
map(spacy_parse, tag = T) %>%
do.call(rbind, .))## user system elapsed
## 14.458 1.221 15.681
Before we move on, we need to clean up the doc_id column of corp_us_words.
corp_us_words into one as provided below?
5.5 Metalingusitic Analysis
Now spacy_parse() has enriched our corpus data with more linguistic annotations. We can now utilize the additional tags for more analysis.
In many applied linguistics studies, people sometimes look at the syntactic complexity of the language across a particular factor. For example, people may look at the syntacitc complexity development of L2 learners of varying proficiency levels, or of L1 speakers in different acquisition stages, or of writers in different genres (e.g., academic vs. nonacademic).
To operationalize the construct sytactmic complexity, we use a simple metric, Fichtner's C, which is defined as:
\[ Fichtner's C = \frac{Number\;of\;Verbs}{Number\;of\;Sentences} \times \frac{Number\;of\;Words}{Number\;of\;Sentences} \]
Now we can take the corp_us_words and first generate the frequencies of verbs, and number of words for each presidential speech text.
syn_com <- corp_us_words %>%
group_by(doc_id) %>%
summarize(verb_num = sum(pos=="VERB"),
sent_num = max(sentence_id),
word_num = n()) %>%
mutate(F_C = (verb_num/sent_num)*(word_num/sent_num)) %>%
ungroup
syn_comWith the syntactic complexity of each president, we can plot the tendency:
syn_com %>%
ggplot(aes(x = doc_id, y = F_C, fill = doc_id)) +
geom_col() +
theme(axis.text.x = element_text(angle=90)) +
labs(title = "Syntactic Complexty", x = "Presidents", y = "Fichtner's C") +
guides(fill = F)
It’s interesting to see a decreasing trend in syntactic complexity!

5.6 Construction Analysis
Now with parts-of-speech tags, we are able to look at more linguistic patterns or constructions in detail. These POS tags allow us to extract more precisely the target patterns we are interested in.
In this section, we will use the output from Exercise 5.1. We assume that now we have a sentence-based corpus data frame, corp_us_sents. Here I like to provide a case study on English Preposition Phrases.
We can utilize the regular expressions to extract PREP + NOUN combinations from the corpus data.
# define regex patterns
pattern_pat1 <- "[^/ ]+/ADP [^/]+/NOUN"
# extract patterns from corp
corp_us_sents %>%
unnest_tokens(output = pat_pp,
input = sentence_tag,
token = function(x) str_extract_all(x, pattern=pattern_pat1)) -> result_pat1
result_pat1In the above example, we specify the token= argument in unnest_tokens(..., token = ...) with a self-defined function. The idea of tokenization in unnest_tokens() is that the token argument should be a function which takes a text-based vector as input (i.e, each element of the input vector may be a document text) and returns a list, each element of which is a token-based version (i.e., vector) of the original input vector element (cf. Figure 5.1).
Figure 5.1: Intuition for token= in unnest_tokens()
In our demonstration, we define a tokenization function, which takes sentence_tag as the input and returns a list, each element of which consists a vector of tokens matching the regular expressions in individual sentences in sentence_tag. (Note: The function object is not assigned to an object name, thus never being created in the R working session.)
pat_clean, with all annotations removed in the data frame result_pat1.
With these constructional tokens of English PP’s, we can then do further analysis.
- We first identify the PREP and NOUN for each constructional token.
- We then clean up the data by removing POS annotations.
# extract the prep and head
result_pat1 %>%
tidyr::separate(col="pat_pp", into=c("PREP","NOUN"), sep="\\s+" ) %>%
mutate(PREP = str_replace_all(PREP, "/[^ ]+",""),
NOUN = str_replace_all(NOUN, "/[^ ]+","")) -> result_pat1a
result_pat1aNow we are ready to explore the text data.
- We can look at how each preposition is being used by different presidents:
- We can examine the most frequent NOUN that co-occurrs with each PREP:
# Most freq NOUN for each PREP
result_pat1a %>%
count(PREP, NOUN) %>%
group_by(PREP) %>%
top_n(1,n) %>%
arrange(desc(n))- We can also look at a more complex usage pattern: how each president uses the PREP
ofin terms of their co-occurring NOUNs?
# NOUNS for `of` uses across different presidents
result_pat1a %>%
filter(PREP == "of") %>%
count(doc_id, PREP, NOUN) %>%
tidyr::pivot_wider(
id_cols = c("doc_id"),
names_from = "NOUN",
values_from = "n",
values_fill = list(n=0))corp_us_sents. Specifically, we can define an English PP as a sequence of words, which start with a preposition, and end at the first word after the preposition that is tagged as NOUN, PROPN, or PRON.
5.7 Issues on Pattern Retrieval
Any automatic pattern retrieval comes with a price: there are always errors returned by the system.
I would like to discuss this issue based on the second text, 1793-Washington. First let’s take a look at the Preposition Phrases extracted by my regular expression used in Exercise 5.4 and 5.5:
## If you haven't finished the exercise, the dataset is also available in `demo_data/result_pat2a.RDS
# result_pat2a <- readRDS("demo_data/result_pat2a.RDS") # uncomment this line if you dont have `result_pat2a`
result_pat2a %>%
filter(doc_id == "1793-Washington")My regular expression has identified 20 PP’s from the text. However, if we go through the text carefully and do the PP annotation manually, we may have different results.
Figure 5.2: Manual Annotation of English PP’s in 1793-Washington
There are two types of errors:
- False Positives: Patterns identified by the system but in fact they are not true patterns.
- False Negatives: True patterns in the data but are not successfully identified by the system.
As shown in Figure 5.2, manual annotations have identified 21 PP’s from the text while the regular expression identified 20 tokens. A comparison of the two results shows that:
- In the above regex result, the following returned tokens (rows highlighted) are false, i.e., False Positives.
| doc_id | sentence_id | PREP | NOUN | pat_pp | row_id |
|---|---|---|---|---|---|
| 1793-Washington | 1 | by | voice | by/adp the/det voice/noun | 1 |
| 1793-Washington | 1 | of | country | of/adp my/det country/noun | 2 |
| 1793-Washington | 1 | of | chief | of/adp its/det chief/propn | 3 |
| 1793-Washington | 2 | for | it | for/adp it/pron | 4 |
| 1793-Washington | 2 | of | honor | of/adp this/det distinguished/adj honor/noun | 5 |
| 1793-Washington | 2 | of | confidence | of/adp the/det confidence/noun | 6 |
| 1793-Washington | 2 | in | me | in/adp me/pron | 7 |
| 1793-Washington | 2 | by | people | by/adp the/det people/noun | 8 |
| 1793-Washington | 2 | of | united | of/adp united/propn | 9 |
| 1793-Washington | 3 | to | execution | to/adp the/det execution/noun | 10 |
| 1793-Washington | 3 | of | act | of/adp any/det official/adj act/noun | 11 |
| 1793-Washington | 3 | of | president | of/adp the/det president/propn | 12 |
| 1793-Washington | 3 | of | office | of/adp office/noun | 13 |
| 1793-Washington | 4 | in | presence | in/adp your/det presence/noun | 14 |
| 1793-Washington | 4 | during | administration | during/adp my/det administration/noun | 15 |
| 1793-Washington | 4 | of | government | of/adp the/det government/propn | 16 |
| 1793-Washington | 4 | in | instance | in/adp any/det instance/noun | 17 |
| 1793-Washington | 5 | to | upbraidings | to/adp the/det upbraidings/noun | 18 |
| 1793-Washington | 5 | of | who | of/adp all/det who/pron | 19 |
| 1793-Washington | 5 | of | ceremony | of/adp the/det present/adj solemn/adj ceremony/noun | 20 |
- In the above manual annotation (Figure 5.2), phrases highlighted in red are NOT successfully identified by the current regex query, i.e., False Negatives.
We can summarize the pattern retrieval results as:

Most importantly, we can describe the quality of the pattern retrieval with two important measures.
- \(Precision = \frac{True\;Positives}{True\;Positives + False\;Positives}\)
- \(Precision = \frac{True\;Positives}{True\;Positives + False\;Negatives}\)
In our case:
- \(Precision = \frac{18}{18+2} = 90%\)
- \(Precision = \frac{18}{18 + 3} = 85.71%\)
It is always very difficult to reach 100% precision or 100% recall for automatic retrieval of the target patterns. Researchers often need to make a compromise. The following are some heuristics based on my experiences:
- For small datasets, probably manual annotations give the best result.
- For moderate-sized dataset, semi-automatic annotations may help. Do the automatic annotations first and follow up with manual checkups.
- For large datasets, automatic annotations are preferred in order to examine the general tendency. However, it is always good to have a random sample of the data to check the query performance.
- The more semantics-related the annotations, the more likely one would adopt a manual approach to annotation (e.g., conceptual metaphors, sense distinctions, dialogue acts).
- Common annotations of corpus data may prefer an automatic approach, such as Chinese word segmentation, POS tagging, named entity recognition, chunking, noun-phrase extractions, or dependency relations(?).
5.8 Saving POS-tagged Texts
We may very often get back to our corpus texts again and again when we explore the data. In order NOT to re-tag the texts every time when we process the data, it would be more convenient if we save the tokenized texts with the POS tags in the hard drive. Next time we can import those files without going trough the POS-tagging again.
However, when saving the POS-tagged results to an external file, it is highly recommended to keep all the tokens of the original texts. That is, leave all the word tokens as well as the non-word tokens intact.
A few suggestions:
- If you are dealing with a small corpus, I would probably suggest you to save the resulting data frame from
spacy_parse()as a csv for later use. - If you are dealing with a big corpus, I would probably suggest you to save the parsed output of each text file in an independent csv for later use.
5.9 Finalize spaCy
While running spaCy on Python through R, a Python process is always running in the background and Rsession will take up a lot of memory (typically over 1.5GB). spacy_finalize() terminates the Python process and frees up the memory it was using.
Exercise 5.6 In this exercise, please use the corpus data provided in quanteda.textmodels::data_corpus_moviereviews. This dataset is provided as a corpus object in the package quanteda.textmodels (please install the package on your own). The data_corpus_moviereviews includes 2,000 movie reviews.
Please use the
spacyrto parse the texts and provide the top 20 adjectives for positive and negative reviews respectively. Adjectives are naively defined as any words whose pos tags start with “J”. When computing the word frequencies, please use the lemmas instead of the word forms.- Please provide the top 20 words that are content words for positive and negative reviews ranked by a weighted score, which is computed using the formula provided below. Content words are naively defined as any words whose pos tags satrt with N, V, or J.
\[Word\;Frequency \times log(\frac{Numbe\; of \; Documents}{Word\;Diserpsion}) \]
- For example, if the lemma action occurs 691 times in the negative reviews collection. These occurrences are scattered in 337 different documents. Then the wegithed score for action is:
\[691 \times log(\frac{1000}{337}) = 751.58 \]
In our earlier chapters, we have discussed the issues of word frequencies and their significance in relation to the dispersion of the words in the entire corpus. In terms of identifying important words from a text collection, our assumption is that: if a word is scattered in almost every documnt in the corpus collection, it is probably less informative. For example, words like a, the would probably be observed in all documents included in the corpus. Therefore, the high frequencies of these widely-dispersed words may not be as important compared to the high frequencies of those which occur in only a subset of the corpus collection. The word frequency is sometimes referred to as term frequency (tf) in information retrieval; the dispersion of the word is referred to as document frequency (df). In information retrieval, people often use a weighting scheme for word frequencies in order to extract informative words from the text collection. The scheme is as follows:
\[tf \times log(\frac{N}{df}) \]
N refers to the total number of documents in the corpus. The \(log\frac{N}{df}\) is referred to as inversed document frequency (idf). This tf.idf weighting scheme is popular in many practical applications.